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Internet scams have been a major concern for everyone over the past decade. 
With the advancement of technology, attackers have formulated different kinds 
of contemporary fraudulent procedures to obtain user’s sensitive information. 
Phishing is one of the oldest and common fraudulent attempts by which ev- 
ery year millions of internet users fall victim to scams resulting in losing their 
money. Different techniques and algorithms have been proposed by researchers 
in detecting phishing websites. However, the detection of phishing websites 
has few challenges since there are different subjective considerations and am- 
biguities involved in the detection process. This paper presents a two-stage 
probabilistic method for detecting phishing websites based on the vote algo- 
rithm. In the first stage, 29 different base classifiers have been used and their 
probabilistic values were calculated. In the second stage, the voting algorithm 
aggregated the probabilistic values of several base classifiers and the phishing 
websites were detected using the average of probabilities approach. The voting 


technique achieved an accuracy of 97.431% outperforming all of the single base 
classifiers in terms of accuracy. 
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1. INTRODUCTION 

Phishing is the deceitful utilization of electronic correspondences to mislead and exploit clients. It 
could be defined as a criminal mechanism to steal user’s personal information like username, password, and 
monetary account details (credit card information). Phishing is the most popular attack among attackers since 
it is simple to target an individual by analyzing his behavior and preferences, which can be done simply by 
stalking him on social networking sites and then personalizing phishing sites/spoofed emails based on the 
analysis. Usually the attack starts with the victim receiving a message containing malicious software like wheel 
of fortune or quiz game. This type of applications are often used by attackers to lure the victim by offering 
them money, gift cards, free coupons or exclusive items. An individual may not understand or recognize that 
he is currently browsing through a phishing site and easily can fall victim to it. 
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As online business or e-commerce grows rapidly users become more vulnerable to phishing. Accord- 
ing to a research study by ‘Verizon’ shows that 30% of phishing messages or spoofed emails get opened by the 
targeted individual [1]. A study by ‘AVANAN’ (a cloud security platform) shows 51% of phishing attacks con- 
tain malicious links. Statistics show the average financial loss for breach in confidential data is 3.86M [2]. The 
“2018 internet crime report” from the Internet Crime Complaint Center (IC3) indicates that $48,241,748 was 
reportedly lost per victim due to phishing/vishing/smishing attacks in the same year [3]. In fact, nearly 86% 
of all phishing was targeted on U.S. entities alone [4]. Which makes the U.S. the top-most vulnerable country 
for phishing. There were 26,379 victims of phishing in 2018 according to the 2018 internet crime report from 
the IC3. Although phishers use several kinds of techniques, most of the phishing website corresponds to some 
common attributes such as redirecting link, prefix or suffix, and (HTTP) token in the uniform resource locator 
(URL). By analyzing a total of 30 attributes, in this paper we proposed a machine learning approach using vote 
algorithm that aggregates multiple base classifiers to detect phishing websites 

Different researcher has presented various methodologies for detecting phishing websites. We took 
cues from prior research. Jain and Gupta [5] provided visual similarities-based techniques to identify phish- 
ing websites from analyzing different feature sets. They analyzed different URL features, hypertext markup 
language (HTML) tags, cascading style sheet (CSS), and images to distinguish a phishing website from a 
legitimate website. The work also analyzed different phishing methods and their exploitation. Ali [6] used 
wrapper-based feature selection technique in combination with machine learning classifiers to detect phish- 
ing websites. This work demonstrated that wrapper-based feature selection improved the overall accuracy of 
the classifiers. The research was conducted using 7 different machine learning classifiers. Among them, the 
random forest classifier achieved the best accuracy of 97.1%. However, the wrapper-based feature selection 
technique may require more time and can consume extra computational overhead with some classifiers. Yang 
et al. [7] proposes a multidimensional feature-driven phishing detection technique using deep learning meth- 
ods. In the first step, they extracted the character sequence features of the URL and later they combined the 
URL statistical features, webpage text features, and the classification result into multidimensional features thus 
identifying a phishing website. They achieved an accuracy of 98.99% while conducting the research on random 
URLs from the internet. The work by Karabatak and Mustafa [8] uses different classifiers on reduced dataset 
to detect phishing website. After taking the dataset [9] instead of using 30 attributes they reduced the dataset 
to 24-27 attributes using various feature selection algorithms. They achieved the highest accuracy of 97.58% 
using Lazy KStar classifier on a reduced dataset of 26 attributes. However, there are no comparison provided 
based on the time required to perform the classification on the reduced dataset. The work by Pan and Ding 
[10] uses the SVM technique to detect phishing web-page. Taking keyword, request URL, server form handler, 
the main body of a web page they tried to detect whether or not the web page is a legitimate site. Using the 
support vector machine (SVM) approach they achieved 84% of success rate. James et al. [11] uses various 
machine learning classifiers to detect phishing websites by analyzing the URL. They collected websites URL 
from Alexa, Dmoz and PhishTank. After analyzing the lexical feature of the URL’s and using 90% test data 
split they achieved a maximum accuracy of 93.78% using the J48 decision tree algorithm. 

Mhaske-Dhamdhere and Vanjale [12] proposes K-means algorithm to detect phishing emails. By tak- 
ing 160 emails, they used K-means algorithm to distinguish between phishing emails and legitimate emails in 
real time. The work by Wardman and Warner[13] proposes an automatic phishing website detection technique 
using the message-digest algorithm. After downloading all the files from a phishing URL and using the MD5 
database provided by the digital PhishNet (DPN), they matched the MD5 checksum with the URLs homepage. 
Using this technique they have been able to identify 30% of phishing websites by matching only the main 
HTML MDS. Mohammad [14] proposed a rule-based phishing website detection method where they imposed 
rules on the data set attributes that can define phishing website. They studied the minimum set of features that 
can be utilized to detect phishing websites. At the initial phase their proposed method achieved an average 
error rate of 5.76%. Later using a reduced feature sets they achieved an accuracy of 95,25%. Several studies 
[15]-[17] have suggested that URLs are the key attribute to easily detect phishing websites. Kumar et al. [18] 
proposes a hybrid methodology of SVM combined with probabilistic neural network model to identify phishing 
emails. Identification of malicious JavaScript-based code has been discussed [19]. Following a thorough exam- 
ination of these works, we used multiple feature sets in our dataset, which includes 30 attributes and aggregated 
various algorithms using the voting technique to effectively identify phishing websites with high precision. 
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2. DATA PREPARATION 

We collected the phishing website dataset from the UCI machine learning repository [9]. The dataset 
contains 11,055 instances of 30 different attributes. Among the 11,055 instances, 4,898 instances are phishing 
websites and 6,157 instances are legitimate websites. We used the feature selection [20] method among the 
attributes and grouped them according to their similarities. Table 1 shows the feature groups created from the 
phishing website dataset attributes. The Feature groups summarizes the key attributes that help in identifying 
the phishing website. Each attribute represents phishing characteristics in a unique way. Further details on 
these feature groups can be found in the work by Mohammad et al. [14]. 


Table 1. Feature groups of phishing website dataset attributes 
Feature group Attributes 


. Having IP address 

. URL length 

. Shortinig service 

. Having at symbol 

. Double slash redirecting 
. Prefix suffix 

. Having sub domain 

. SSLfinal state 


. Domain registration length 


URL based features 


CMAN DN FF WN 


10. Favicon 

11.Port 

12.HTTPS token 

13.Request URL 

14.Redirect 

15.On mouseover 
JavaScript based features 16.RightClick 

17.popUpWidnow 

18 Iframe 

19.URL of anchor 

20.Links in tags 

21.SFH 

22.Submitting to email 

23.Abnormal URL 

24.Links pointing to page 


Anomaly based features 


25.Age of domain 
26.DNSRecord 
27.Web traffic 
28.Page rank 
29.Google index 


Statistics based features 


30.Statistical report 


2.1. URL based feature 

URL’s can provide a lot of information regarding a webpage. We take into account 13 attributes in 
the URL-based feature that indicates a phishing website. The features include having IP address instead of 
URL, long URL lengths that can potentially have hidden links inside it, URL shortening services like “Bitly” 
or “Tiny URL”, URL having @ symbol that will potentially submit the information into an email, redirecting 
using double slash “//’, having prefix-suffix in any URL, having no secure sockets layer (SSL) final state, Short 
domain registration link, using an uncommon port, having any subdomain, having HTTPS token in the URL 
and having any request URL strongly indicates that the website is unauthorized. 


2.2. Anomaly based feature 
In anomaly-based features, we take into account 6 attributes that indicates a phishing website. The 
features include URL of anchors connected to a different domain, having links in tags, server form handler is 
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either empty or “about:blank”, submitting information to email, abnormal URL where host name is absent in 
the URL and having links pointing to a page strongly indicates that the website is unauthorized. 


2.3. JavaScript based feature 

JavaScript is basically a scripting language used on the client-side of a website to make. Developers 
use JavaScript for making an interactive and animated web page. When a user sends some request in JavaScript 
enabled page, the script is sent to the browser to process the request. The attackers use these features to deceive 
the users by adding JavaScript on the phishing web page and making it look authentic. In JavaScript based 
feature we take account into 5 attributes that indicates a phishing website. The features include web page 
redirecting, using on mouseover to hide any link, right click disabled, showing pop up window and Iframe 
redirecting indicates that the website is unauthorized. 


2.4. Statistics based feature 

In statistics based features, we take account into 6 different attributes to detect a phishing website. 
These attributes mainly corresponds to statistical analysis. The features include the age of domain is less than 
6 months, having no DNS record, less web traffic, page rank is lower, low google index score and lack of a 
statistical report suggests that the website is unauthorized. 


3. PROPOSED METHOD 

We employed a two-stage probabilistic model in our proposed model to detect phishing websites more 
accurately by minimizing the variance error. In the first stage, we calculated the probabilistic values given by 
the individual base classifiers for each output class. In the second stage, we took the probabilistic values given 
by each base classifier and used the voting algorithm to aggregate them. In the vote algorithm, we combine 
multiple base classifiers and using the output probabilities of different base classifiers we make the decision. 

Different kinds of voting techniques are available, such as majority voting, average of probabilities, 
product of probabilities, median, minimum probabilities, maximum probabilities [21]. Vote algorithm can be 
used in any kind of class such as binary, nominal, date class, and numeric class. In this study, we employed 
the average of probabilities voting algorithm on our binary class phishing website dataset. In the average of 
probabilities, the algorithm checks the probabilities of every individual base classifier and averages the net 
probability. Considering each of the base classifier’s output probabilities independent of each other, then av- 
eraging the probabilities helps in reducing the variance error that could be caused by a single base classifier. 
After computing the net average probability, the class label is assigned to the class having the maximum prob- 
ability. Since there are only two class labels in our dataset hence, the voting algorithm simply calculated the 
probabilities of every single base classifier in the first stage, then averaged the probabilities in the second stage 
and predicted the class label. Figure 1 shows the flow diagram of our proposed method. 


Data Collection Process Stage 1 Stage 2 Classification Process 


Probability(1) Vote Algorithm 


Average Probabilities of Phishing 
Data is Greater than 50% 


Probability(2 ‘iti 
Load Phishing y(2) ee > Probabilities 


Website Dataset or No. of Classifiers 


Probability(3) 


Phishing Legitimate 
7 wae Website Website 
Averaging the probabilities using 


Probability(n) Vote Algorithm 
Feature Selection 


Figure 1. A method of detecting phishing website using voting algorithm 


We combined multiple base classifiers using vote algorithm. And for each classifier we got a proba- 
bilistic value for our class label phishing website. The (1) shows the sum of the probabilities given by all the 
base classifiers when the class label is -1 in our dataset which is phishing website. 
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So Pokishing = Pphishing( l) + Pphishing(2) Miast Pphishing (n) (1) 


Then the algorithm average the probabilities by dividing it with the number of classifiers used. The 
(2) shows the average probability for phishing calculation. Here n denotes the number of classifiers used. 


> Pyhishing 
n 


(2) 


We compare the average probability value with 50% because we have binary class labels in our dataset. 
When the average probability of a phishing website is greater than 0.5, we define the class label as phishing 
website. Conversely, it is the same for the legitimate website. 


Avg(Pphishing) = 


if Avg(Pphishing) > 0.5, thenclass — label = phishing 


2 
i o 
Mean variance error = 2a (3) 
n 
Now assuming that errors of the base classifiers are independent of each other then for given n individ- 
ual observations P,, P2, P3, ....., P, each having variance o?, the mean variance error is given by (3). Here the 


mean-variance error of the voting algorithm can be smaller than the variance error of any single base classifier. 
Thus in several cases, the voting algorithm reduces the variance error of the individual base classifiers resulting 
in overall better accuracy. 


4. RESULTS ANALYSIS 
4.1. Classification performance 

We employed the machine learning tools weka [22] and rapidminer [23] for the classification of the 
phishing websites. The experiment was carried out on a system with a GeForce GTX 1060 graphics card and 16 
GB of RAM. 29 different base classifier with a ten-fold cross-validation was used to evaluate the performance 
of each classifier on raw data. The classification accuracy of our experiment is shown in Figure 2. 

From Figure 2, we observe that random forest [24] achieved the highest accuracy among the single 
base classifiers in the first stage. Random committee [25], Lazy KStar [26] and IBK (k nearest neighbor) [27] 
all of them achieved an accuracy of more than 97%. So we discard all of the classifiers having less than 97% 
accuracy and considered the base classifiers that achieved more than 97% accuracy in the second stage. Along 
with classification accuracy, we have taken account of the receiver operating characteristic (ROC) and the time 
complexity of the base classifiers. In the second stage, we compared the accuracy, ROC and time complexity 
of the base classifiers and combined the base classifiers into different combinations using the voting algorithm 
to calculate the net probability for the binary class. Table 2 shows the confusion matrix of phishing website 
classification. From the confusion matrix we can observe true positive rate, true negative rate, false positive 
rate, false negative rate and accuracy of the classifier and hence the accuracy is calculated using the formula 
Accuracy = mpri e ryrN (%) 

Considering time constraint we observed that, random committee performed best with a time of 1.57 
seconds while completing 10 fold cross-validation. Random forest and IBK performed very similar while 
having a time complexity of 10.54 seconds and 9.60 seconds respectively. The Lazy kStar took maximum time 
of 348.67 seconds on our machine for 10 fold cross-validation while predicting the phishing websites, which 
is inconvenient for a large dataset. Therefore, we excluded the Lazy KStar from voting technique. The result 
analysis of vote algorithm on pre-selected classifiers is shown in Table 3. 

Based on the results reported in Table 3, we can clearly observe that the vote algorithm with every 
combination outperformed every other single base classifier in terms of accuracy. Firstly, we considered 3 
base classifiers random forest, random committee, IBK and combined them using the vote algorithm. This 
combination achieved the maximum accuracy of 97.431% with a time of 21.71 seconds. Later we considered 2 
base classifiers with different combinations and compared the accuracy. A combination of random committee 
and IBK achieved an accuracy of 97.359% with a time of 10.17 seconds. And a combination of random forest 
and IBK achieved an accuracy of 97.332% with a time of 20.14 seconds. Among the single base classifiers, 
random forest achieved the highest accuracy on 10-fold cross validation. 
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Figure 2. Performance of different classifiers on raw phishing website dataset 


Table 2. Confusion matrix of phishing website classification 
Predicted phishing Predicted Legitimate 
Actual phishing True positive (TP) False negative (FN) 
Actual Legitimate False positive (FP) True negative (TN) 


Table 3. Classification accuracy, confusion matrix, ROC and time needed for pre-selected classifiers 
(results only for raw sample dataset, sorted by accuracy in descending order) 


Classifier Accuracy (%) Precision Recall ROC Time (Sec) 
Vote (random forest + IBK + random vommittee) 97.431% 0.974 0.974 0.996 21.71 s 
Vote (random vommittee + IBK) 97.359% 0.974 0.974 0.993 10.17 s 
Vote (random forest + IBK) 97.332% 0.973 0.973 0.996 20.14 s 
Random forest 97.259% 0.973 0.973 0.996 10.54 s 
Rando committee 97.241% 0.972 0.972 0.992 1.57 s 
Lazy KStar 97.196% 0.972 0.972 0.997 348.67 s 
IBK 97.178% 0.972 0.972 0.989 9.60 s 


The confusion matrix of vote algorithm along with other classifiers is shown in Figure 3 respectively 
shown in Figures 3(a)-3(£). By comparing the confusion matrix of random forest in Figure 3(d) to the matrix 
3(a)-3(c) of the vote method, we can observe that the vote algorithm reduced the number of false positive and 
false negative occurrences, resulting in a lower error rate. The same thing happened with the random committee 


and IBK. 
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Considering the Area under the ROC curve(AUC) covered by the classifiers, the vote algorithm per- 
forms considerably well in AUC along with other classifiers. Figure 4 demonstrates the RUC curve of different 
algorithms. The ROC curve for the voting method is nearly a perfect curve, covering an area of 0.996 in the 
AUC. The ROC curve for the ZeroR method is the lowest, covering a 0.902-square-meter region. 


Phishing 
Phishing 


6000 
107 6050 


106 6051 6039 


Legitimate 
Legitimate 
Legitimate 


5000 


Phishing Legitimate Phishing Legitimate Phishing Legitimate 
(a) (b) (c) 4000 


3000 


2000 


Phishing 


1000 


6030 124 6033 


6047 


Legitimate 
Legitimate 
Legitimate 


Phishing Legitimate Phishing Legitimate Phishing Legitimate 


(d) (e) (f) 


Figure 3. Confusion matrix of different classifiers (a) vote (random forest+IBK+random committee), 
(b) vote (random committee+IBK), (c) vote (random forest+IBK), (d) random forest, (e) random vommittee, 
and (f) IBK 


True Positive Rate 


False Positive Rate 


Vote ( Random Forest + IBK + 
—9- ( =% LazykStar = RandomForest =$} — IBK —@— Random Committee 


Random Commitee) 


Figure 4. ROC curve of different algorithms on phishing dataset 


4.2. Discussion and findings 
After analyzing the overall results, we have acquired some interesting findings in our study. The 


findings are as follows: 

- In multiple instances, the vote algorithm reduced the False Positive and False Negative instances resulting 
in higher accuracy than the single base classifiers. However, the voting technique required more time to 
perform the classification task than the single base classifiers. 

- The Lazy KStar achieved the maximum ROC while it also took considerably long time to perform the 
classification task. Hence, there is obviously a trade-off between the time and the ROC of the base 
classifiers. 

- The Lazy KStar took the minimum time to perform the classification task yet provided a similar accuracy 
level to the voting algorithm. Hence, the Lazy KStar should be preferred for a faster classification process 
over the voting algorithm. 

- In case of time constraint is not a concern, the vote algorithm should be preferred for the classification 
task, since it will result in higher overall accuracy. 


Indonesian J Elec Eng & Comp Sci, Vol. 28, No. 3, December 2022: 1582-1591 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 o 1589 


Various authors have used different approaches towards phishing website detection. A statistical com- 
parison between different phishing detection techniques along with our proposed model is shown in Table 4. 
From table 4, in terms of accuracy and time complexity, the vote algorithm provided much better accuracy than 
the wrapper-based machine learning technique proposed by Ali [6] on the same dataset. Also, without reduc- 
ing the parameters, the vote algorithm achieved a similar level accuracy to the result reported by Karabatak 
and Mustafa [8]. Comparing the accuracy, Precision, Recall, ROC and time complexity, we conclude that the 
vote algorithm reduced the variance error of different single base classifiers and performed better in identifying 
phishing websites accurately. 


Table 4. Comparison between existing phishing detection approaches with our proposed technique 


Author Approach Dataset used Accuracy 
Ali [6] Wrapper based feature selection approach UCI machine learning repository 97.1% 
phishing dataset 
Yang et al. [7] Deep learning based multidimensional Random Url’s from the internet 98.99% 


feature driven approach 
Karabatak and Mustafa [8] Reduced feature selection based approach UCI machine learning repository phishing 97.58% 


dataset 
Pan and Ding [10] DOM object anomalies based anti-phishing Random Url’s from the internet 84% 
approach 
James et al. [11] Lexical feature based approach Url’s from Alexa, DMOZ, and PhishTank 93.78% 
Mohammad et al. [14] Intelligent rule-based approach Url’s from PhishTank and Millersmiles 95.25% 
Proposed model Vote algorithm based approach UCI machine learning repository phishing 97.431% 
dataset 


5. CONCLUSION 

In the age of the internet, cyber security is a major concern for everyone. Phishing is a prevalent type of 
cyber attack that everyone should be aware of in order to stay safe. In this study, a two-stage probabilistic model 
based on vote algorithm has been proposed for detecting phishing websites. Firstly, we performed classification 
using 29 different base classifiers on phishing website dataset taken from the UCI machine learning repository. 
Based on the results of 29 base classifiers, we selected four base classifiers having more than 97% accuracy. By 
analyzing the confusion matrix, ROC area and time required to complete 10 fold cross-validation on selected 
classifiers, we discarded the Lazy KStar algorithm due to its time constraints. We aggregated the other three 
base classifiers using our proposed vote algorithm. 

The classification results indicate that the voting method minimizes false positive and false negative 
instances of single base classifiers for any combination of base classifiers, thus reducing the error rate. Combin- 
ing three base classifiers, vote algorithm achieved a maximum accuracy of 97.431% outperforming all single 
base classifiers in terms of accuracy. However, the voting technique takes longer than single base classifiers 
to perform classification. Our experiment was employed on raw data without any filter or data segmentation. 
The accuracy can further be increased by using filters or data segmentation on raw data. In the future, we plan 
to integrate our proposed vote algorithm based phishing detection algorithm into a browser extension that will 
detect any phishing website or phishing links in real-time. 
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